{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Gradient-based optimization" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If information about the gradient is available, it can accelarate convergence substantially." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Newton's optimization method" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Newton's optimization routine aims to find the root of the gradient, which is the extremal. Since we are now focussed on scalar $f(\\vec{x})$ the gradient is a vector and we will need the Hessian matrix $$H = \\frac{\\partial^2 f}{{\\partial \\vec{x}}^2}$$\n", "\n", "The increment is now solved as:\n", "\n", "$$\n", "H \\Delta \\vec{x} = - \\nabla f\n", "$$\n", "\n", "Noting that $H$ must be symmetric." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Netwon's optimization method has excellent convergence criteria but requires calculation of the Hessian which can be computationally expensive." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "from scipy.linalg import solve\n", "import plotly.graph_objects as go\n", "\n", "def newton_method(f, grad_f, hessian_f, x0, tol=1e-6, max_iter=100):\n", " x = x0\n", " print(x)\n", " guesses = [x]\n", " for _ in range(max_iter):\n", " grad = grad_f(x)\n", " hess = hessian_f(x)\n", "\n", " # ~~ What goes here?\n", "\n", " ###\n", " delta_x = solve(hess, -grad, assume_a = 'sym')\n", " ###\n", " x = x + delta_x\n", " guesses.append(x)\n", " print(x)\n", " if np.linalg.norm(grad) < tol:\n", " break\n", "\n", " # Create a surface plot of the function\n", " x = np.linspace(-2, 2, 100)\n", " y = np.linspace(-2, 2, 100)\n", " X, Y = np.meshgrid(x, y)\n", " Z = f([X, Y])\n", "\n", " fig = go.Figure(data=[go.Surface(x=X, y=Y, z=Z)])\n", "\n", " # Add markers for each guess\n", " for guess in guesses:\n", " fig.add_trace(go.Scatter3d(\n", " x=[guess[0]],\n", " y=[guess[1]],\n", " z=[f(guess)],\n", " mode='markers',\n", " marker=dict(\n", " size=5,\n", " color='red'\n", " )\n", " ))\n", "\n", " fig.update_layout(\n", " title='Newton\\'s Method Optimization',\n", " scene=dict(\n", " xaxis_title='x',\n", " yaxis_title='y',\n", " zaxis_title='f(x,y)'\n", " )\n", " )\n", " fig.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Example: Minimize $x^4 + y^4$" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[1.5 2. ]\n", "[1. 1.33333333]\n", "[0.66666667 0.88888889]\n", "[0.44444444 0.59259259]\n", "[0.2962963 0.39506173]\n", "[0.19753086 0.26337449]\n", "[0.13168724 0.17558299]\n", "[0.0877915 0.11705533]\n", "[0.05852766 0.07803688]\n", "[0.03901844 0.05202459]\n", "[0.02601229 0.03468306]\n", "[0.01734153 0.02312204]\n", "[0.01156102 0.01541469]\n", "[0.00770735 0.01027646]\n", "[0.00513823 0.00685097]\n", "[0.00342549 0.00456732]\n", "[0.00228366 0.00304488]\n" ] }, { "data": { "text/html": [ "\n", "\n", "\n", "

\n", "

\n", "\n", "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "\n", "def f(x):\n", " return x[0]**4 + 2*x[1]**4\n", "\n", "def grad_f(x):\n", " return np.array([4*x[0]**3, 8*x[1]**3])\n", "\n", "def hessian_f(x):\n", " return np.array([[12*x[0]**2, 0], [0, 24*x[1]**2]])\n", "\n", "# Initial guess\n", "x0 = np.array([1.5, 2])\n", "\n", "# Perform Newton's method\n", "newton_method(f, grad_f, hessian_f, x0)\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Example Minimize $x^2+y^2$" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[1.5 1.5]\n", "[0. 0.]\n", "[0. 0.]\n" ] }, { "data": { "text/html": [ "\n", "\n", "\n", "

\n", "

\n", "\n", "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "import numpy as np\n", "from scipy.linalg import solve\n", "import plotly.graph_objects as go\n", "\n", "def f(x):\n", " return x[0]**2 + x[1]**2\n", "\n", "def grad_f(x):\n", " return np.array([2*x[0], 2*x[1]])\n", "\n", "def hessian_f(x):\n", " return np.array([[2, 0], [0, 2]])\n", "\n", "# Initial guess\n", "x0 = np.array([1.5, 1.5])\n", "\n", "# Perform Newton's method\n", "newton_method(f, grad_f, hessian_f, x0)\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Why does this converge so fast?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Example: $x^2 - 6 x y +y^2$" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[1.5 2. ]\n", "[-2.22044605e-16 0.00000000e+00]\n", "[0. 0.]\n" ] }, { "data": { "text/html": [ "\n", "\n", "\n", "

\n", "

\n", "\n", "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "def f(x):\n", " return x[0]**2 + x[1]**2 - 6*x[0]*x[1]\n", "\n", "def grad_f(x):\n", " return np.array([2*x[0] - 6*x[1], 2*x[1] - 6*x[0]])\n", "\n", "def hessian_f(x):\n", " return np.array([[2, -6], [-6, 2]])\n", "\n", "# Initial guess\n", "x0 = np.array([1.5, 2.])\n", "\n", "# Perform Newton's method\n", "newton_method(f, grad_f, hessian_f, x0)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Yeehaw Giddyup!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Gradient decent methods" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you have information about the gradient (exactly or approximately), moving down the gradient is an intuitive approach to reach the minimum. While fairly fool-proof, as we saw that the steepest decent can lead to *zig-zagging* which motivated constructing orthogonal / conjugate directions which limit their interference with each other.\n", "\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Recal each step is incremented:\n", "$$\\vec{x}^{i+1} = \\vec{x}^i+a \\vec{p}$$\n", "where $a$ is the step length and $\\vec{p}$ is the step direction.\n", "\n", "The steepest decent, $p=-\\nabla f$ maximizes the change in $f$ *in the immediate neighbourhood*, but a different direciton may permit longer step lengths. In general:\n", "\n", "$$f(\\vec{x}^{1+1})-f(\\vec{x}^i) \\le -a \\|\\nabla{f}\\| \\|p\\| \\bigg[cos(\\theta) -\\max_{t ∈ [0,1]} \\frac{\\|\\nabla f(\\vec{x}^i-t a \\vec{p}) - \\nabla f(\\vec{x}) \\|}{\\|\\nabla f(\\vec{x})\\|}\\bigg]$$\n", "\n", "\n", "The second term assesses the rate of change of $\\nabla f$ and involves new quantities, so an exact calculation is typically avoided. Approximations to this term (or alternative algorithmic tools) give rise to different methods." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Stoichastic gradient decent" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Machine learning training involves optimizing a model by finding an appropriate set of parameters. The parameter space can be very high dimensional, and the resulting function can be highly complex. In this case the gradient may be approximated by randomly sampling the change in subsets of parameters. The step length $a$ is renamed the *learning rate*. " ] } ], "metadata": { "language_info": { "name": "python" } }, "nbformat": 4, "nbformat_minor": 2 }